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Abstract 

Independent component analysis (ICA) is the problem of efficiently recovering a matrix 
A € from i.i.d. observations of A = AS where S € MA is a random vector with mutually 

independent coordinates. This problem has been intensively studied, but all existing efficient 
algorithms with provable guarantees require that the coordinates Si have finite fourth moments. 
We consider the heavy-tailed ICA problem where we do not make this assumption, about the 
second moment. This problem also has received considerable attention in the applied literature. 
In the present work, we first give a provably efficient algorithm that works under the assumption 
that for constant 7 > 0 , each Si has finite (1 -|- 7 )-moment, thus substantially weakening the 
moment requirement condition for the ICA problem to be solvable. We then give an algorithm 
that works under the assumption that matrix A has orthogonal columns but requires no moment 
assumptions. Our techniques draw ideas from convex geometry and exploit standard properties 
of the multivariate spherical Gaussian distribution in a novel way. 


1 Introduction 

The blind source separation problem is the general problem of recovering underlying “source signals” 
that have been mixed in some unknown way and are presented to an observer. Independent compo¬ 
nent analysis (ICA) is a popular model for blind source separation where the mixing is performed 
linearly. Formally, if S is an n-dimensional random vector from an unknown product distribution 
and A is an invertible linear transformation, one is tasked with recovering the matrix A and the 
signal S, using only access to i.i.d. samples of the transformed signal, namely X = AS. Due to 
natural ambiguities, the recovery of A is possible only up to the signs and permutations of the 
columns. Moreover, for the recovery to be possible the distributions of the random variables Si 
must not be a Gaussian distribution. ICA has applications in diverse areas such as neuroscience, 
signal processing, statistics, machine learning. There is vast literature on ICA; see, e.g., PEI [3]. 

Since the formulation of the ICA model, a large number of algorithms have been devised em¬ 
ploying a diverse set of techniques. Many of these existing algorithms break the problem into 
two phases: first, find a transformation which, when applied to the observed samples, gives a new 
distribution which is isotropic, i.e. a rotation of a (centered) product distribution; second, one 
typically uses an optimization procedure for a functional applied to the samples, such as the fourth 
directional moment, to recover the axes (or basis) of this product distribution. 

’andejoseOcse. ohio-state. edu , Dept, of Computer Science and Engineering, Ohio State University 
Gavingo@microsoft.com, Microsoft Research 

Gandi.10@osu.edu, Dept, of Computer Science and Engineering, Ohio State University 
Grademac@cse.ohio-state.edu, Dept, of Computer Science and Engineering, Ohio State University 


1 



Our focus in this paper will be on efficient algorithms with provable guarantees and finite sample 
analysis. To our knowledge, all known efficient algorithms for ICA with provable guarantees require 
higher moment assumptions such as finiteness of the fourth or higher moments for each component 
Si- Some of the most relevant works, e.g. algorithms of lais], explicitly require the fourth moment 
to be finite. Algorithms in [elE], which make use of the characteristic function also seem to require 
at least the fourth moment to be hnite: while the characteristic function exists for distributions 
without moments, the algorithms in these papers use the second or higher derivatives of the (second) 
characteristic function, and for this to be well-defined one needs the moments of that order to exist. 
Furthermore, certain anticoncentration properties of these derivatives are needed which require that 
fourth or higher moments exist. 

Thus the following question arises: is ICA provably efficiently solvable when the moment con¬ 
dition is weakened so that, say, only the second moment exists, or even when no moments exist? 
By heavy-tailed ICA we mean the ICA problem with weak or no moment conditions (the precise 
moment conditions will be specified when needed). 

While we consider this problem to be interesting in its own right, it is also of interest in practice 
in a range of applications, e.g. [HI [g dm El [Ill la Hi]. The problem could also be interesting 
from the perspective of robust statistics because of the following informal connection: algorithms 
solving heavy-tailed ICA might work by focusing on samples in a small (but not low-probability) 
region in order to get reliable statistics about the data and ignore the long tail. Thus if the data 
for ICA is corrupted by outliers, the outliers are less likely to affect such an algorithm. 

In this paper, heavy-tailed distributions on the real line are those for which low order moments 
are not finite. Specifically, we will be interested in the case when the fourth or lower order moments 
are not finite as this is the case that is not covered by previous algorithms. We hasten to clarify that 
in some ICA literature the word heavy-tailed is used with a different and less standard meaning, 
namely distributions with positive kurtosis; this meaning will not be used in the present work. 

Heavy-tailed distributions arise in a wide variety of contexts including signal processing and 
finance; see [laile] for an extensive bibliography. Some of the prominent examples of heavy-tailed 
distributions are the Pareto distribution with shape parameter a which has moments of order less 
than a, the Cauchy distributions, which has moments of order less than 1; many more examples 
can be found on the Wikipedia page for heavy-tailed distributions. An abundant (and important 
in applications) supply of heavy-tailed distributions comes from stable distributions; see, e.g., [T5] . 
There is also some theoretical work on learning mixtures of heavy-tailed distributions, e.g., [igiiH]. 

In several applied ICA models with heavy tails it is reasonable to assume that the distributions 
have finite first moment. In applications to finance (e.g., m), heavy tailed distributions are 
commonly used to model catastrophic but somewhat unlikely scenarios. A standard measures of 
risk in that literature, the so called conditional value at risk [ 20 ], is only finite when the first moment 
is finite. Therefore, it is reasonable to assume for some of our results that the distributions have 
finite first moment. 

1.1 Our results 

Our main result is an efficient algorithm that can recover the independent components when each 
Si has 1 -|- 7 moments for 7 a positive constant. The following theorem states more precisely the 
guarantees of our algorithm. The theorem below refers to the algorithm Fourier PCA [] which 
solves ICA under the fourth moment assumption. The main reason to use this algorithm is that 
finite sample guarantees have been proved for it; we could have plugged in any other algorithm 
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with such guarantee. The theorem below also refers to Gaussian damping, which is an algorithmic 
technique we introduce in this paper and will be explained shortly. 

Theorem 1 (Heavy-tailed ICA). Let X = AS he an ICA model such that the distribution of S is 
absolutely continuous, for all i we have E(|S'j|^+'>') < M < oo and normalized so that ElS'jl = 1, 
and the columns of A have unit norm. Let A > 0 be such that for each i G [n] if Si has finite 
fourth moment then its fourth cumulant satisfies |cum 4 (S'j)| > A. Then, given 0 < e < n^, > 0, 

sm > c’'max(^); Sm < cJ min (A), Algorithm\^ combined with Gaussian damping and Fourier PGA 
outputs bi,...,bn G M” such that there are signs ai G {—1,1} and a permutation vr : [n] —>• [n] 
satisfying ||Aj — II ^ poly..^(n, M, l/sm, sm; 1/A, i?, 1/i?, 1/e, 1/(5) time and sample 

complexity and with probability at least 1 — 5. Here R is a parameter of the distributions of the Si 
as described below. The degree of the polynomial is 0 ( 1 / 7 ). 

(The assumption that S has an absolutely continuous distribution is mostly for convenience in 
the analysis of Gaussian damping and not essential. In particular, it is not used in Algorithm [T]) 

Intuitively, R in the theorem statement above measures how large a ball we need to restrict the 
distribution to, which has at least a constant (actually l/poly(n) suffices) probability mass and, 
moreover, each Si when restricted to the interval [—i?, i2] has fourth cumulant at least H(A). We 
show that all sufficiently large R satisfy the above conditions and we can efficiently compute such 
an R; see the discussion after Theorem [2] (the restatement in Sec. [7]). For standard heavy-tailed 
distributions, such as the Pareto distribution, R behaves nicely. For example, consider the Pareto 
distribution with shape parameter = 2 and scale parameter = 1, i.e. the distribution with density 
2/t^ for t > 1 and 0 otherwise. For this distribution it’s easily seen that R = H(A^/^) suffices for 
the cumulant condition to be satisfied. 

Theorem [1] requires that the (1 -|- 7 )-moment of the components Si be finite. However, if the 
matrix A in the ICA model is unitary (i.e. A^A = J, or in other words, A is a rotation matrix) 
then we do not need any moment assumptions: 

Theorem 2. Let X = AS be an IGA model such that A G is unitary (i.e., A = I) and 

the distribution of S is absolutely continuous. Let A > 0 6e such that for each i G [n] if Si has 
finite fourth moment then |cum 4 (iS'j)| > A. Then, given €,5 > 0, Gaussian damping combined with 
Fourier PGA outputs 61 ,..., G M" such that there are signs ai G {—1,1} and a permutation 
TT : [re] —)■ [re] satisfying \\Ai — Q;j6^(j)|| < e, in poIy(re, i?, 1/A, 1/e, 1/(5) time and sample complexity 
and with probability at least 1 — 5. Here R is a parameter of the distributions of the Si as described 
above. 

Idea of the algorithm. Like many ICA algorithms, our algorithm has two phases: first orthog- 
onalize the independent components (reduce to the pure rotation case), and then determine the 
rotation. In our heavy-tailed setting, each of these phases requires a novel approach and analysis 
in the heavy-tailed setting. 

A standard orthogonalization algorithm is to put X in isotropic position using the covariance 
matrix Cov(A) := E(AA^). This approach requires finite second moment of X, which, in our 
setting, is not necessarily finite. Our orthogonalization algorithm (Section [6|) only needs finite 
(1 -|- 7 )-absolute moment and that each Si is symmetrically distributed. (The symmetry condition 
is not needed for our ICA algorithm, as one can reduce the general case to the symmetric case, see 
Section [ 9 ]). In order to understand the first absolute moment, it is helpful to look at certain convex 
bodies induced by the first and second moment. The directional second moment Ex(('U^A)^) is 
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a quadratic form in u and its square root is the support function of a convex body, Legendre’s 
inertia ellipsoid, up to some scaling factor (see |21j for example). Similarly, one can show that 
the directional absolute first moment is the support function of a convex body, the centroid body. 
When the signals Si are symmetrically distributed, the centroid body of X inherits these symmetries 
making it absolutely symmetric (see Section[2]for definitions) up to an affine transformation. In this 
case, a linear transformation that puts the centroid body in isotropic position also orthogonalizes 
the independent components (Lemma [T9l) . In summary, the orthogonalization algorithm is the 
following: find a linear transformation that puts the centroid body of X in isotropic position. One 
such matrix is given by the inverse of the square root of the covariance matrix of the uniform 
distribution in the centroid body. Then apply that transformation to X to orthogonalize the 
independent components. 

We now discuss how to determine the rotation (the second phase of our algorithm). The 
main idea is to reduce heavy-tailed case to a case where all moments exist and to use an existing 
ICA algorithm (from [7] in our case) to handle the resulting ICA instance. We use Gaussian 
damping to achieve such a reduction. By Gaussian damping we mean to multiply the density of 
the orthogonalized ICA model by a spherical Gaussian density. 

We elaborate now on our contributions that make the algorithm possible. 

Centroid body and orthogonalization. The centroid body of a compact set was first defined in 
m- It is defined as the convex set whose support function equals the directional absolute first 
moment of the given compact set. We generalize the notion of centroid body to any probability 
measure having finite first moment (see Section [2] for the background on convexity and Section 
[3] for our formal definition of the centroid body for probability measures). In order to put the 
centroid body in approximate isotropic position, we estimate its covariance matrix. For this, we 
use uniformly random samples from the centroid body. There are known methods to generate 
approximately random points from a convex body given by a membership oracle. We implement 
an efficient membership oracle for the centroid body of a probability measure with 1 + 7 moments. 
The implementation works by first implementing a membership oracle for the polar of the centroid 
body via sampling and then using it via the ellipsoid method (see |23j ) to construct a membership 
oracle for the centroid body. As far as we know this is the first use of the centroid body as an 
algorithmic tool. 

An alternative approach to orthogonalization in ICA one might consider is to use the empirical 
covariance matrix of X even when the distribution is heavy-tailed. A specific problem with this 
approach is that when the second moment does not exist, the diagonal entries would be very 
different and grow without bound. This problem gets worse when one collects more samples. This 
wide range of diagonal values makes the second phase of an ICA algorithm very unstable. 

Linear equivariance and high symmetry. A fundamental property of the centroid body, for 
our analysis, is that the centroid body is linearly equivariant, that is, if one applies an invertible 
linear transformation to a probability measure then the corresponding centroid body transforms 
in the same way (already observed in [ 22 ] ). In a sense that we make precise (Lemma [T 8 |) , high 
symmetry and linear equivariance of an object defined from a given probability measure are sufficient 
conditions to construct from such object a matrix that orthogonalizes the independent components 
of a given ICA model. This is another way to see the connection between the centroid body and 
Legendre’s ellipsoid of inertia for our purposes: Legendre’s ellipsoid of inertia of a distribution is 
linearly equivariant and has the required symmetries. 

Gaussian damping. Here we confine ourselves to the special case of ICA when the ICA matrix 
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is unitary, that is A = I. A natural idea to deal with heavy-tailed distributions is to truncate 
the distribution in far away regions and hope that the truncated distribution still gives us a way 
to extract information. In our setting, this could mean, for example, that we consider the random 
variable obtained from X conditioned on the even that X lies in the ball of radius R centered at the 
origin. Instead of the ball we could restrict to other sets. Unfortunately, in general the resulting 
random variable does not come from an ICA model (i.e., does not have independent components 
in any basis). Nevertheless one may still be able to use this random variable for recovering A. We 
do not know how to get an algorithmic handle on it even in the case of unitary A. Intuitively, 
restricting to a set breaks the product structure of the distribution that is crucial for recovering 
the independent components. 

We give a novel technique to solve heavy-tailed ICA for unitary A. No moment assumptions on 
the components are needed for our technique. We call this technique Gaussian damping. Gaussian 
damping can also be thought of as restriction, but instead of being restriction to a set it is a 
“restriction to a spherical Gaussian distribution.” Let us explain. Suppose we have a distribution 
on M” with density px (•) ■ If we restrict this distribution to a set A (which we assume to be nice: full¬ 
dimensional and without any measure theoretic issues) then the density of the restricted distribution 
is 0 outside A and is proportional to px{x) for x G A. One can also think of the density of the 
restriction as being proportional to the product of px{x) and the density of the uniform distribution 
on A. In the same vein, by restriction to the Gaussian distribution with density proportional to 
e“ll*ll A we simply mean the distribution with density proportional to px{x) A . in other 

words, the density of the restriction is obtained by multiplying the two densities. By Gaussian 
damping of a distribution we mean the distribution obtained by this operation. 

Gaussian damping provides a tool to solve the IGA problem for unitary A by virtue of the 
following properties: (1) The damped distribution has finite moments of all orders. This is an 
easy consequence of the fact that Gaussian density decreases super-polynomially. More precisely, 
one dimensional moment of order d given by the integral dt is hnite for all d > 

0 for any distribution. (2) Gaussian damping retains the product structure. Here we use the 
property of spherical Gaussians that it’s the (unique) class of spherically symmetric distributions 
with independent components, i.e., the density factors: e^^W 1^ = (we are 

hiding a normalizing constant factor). Hence the damped density also factors when expressed in 
terms of the components of s = A~^x (again ignoring normalizing constant factors): 

px{x)eA^\\‘^/^^ = • • • psASn)e~^^^^\ 

Thus we have converted our heavy-tailed ICA model X = AS into another ICA model Xji = ASr 
where Xr and Sr are obtained by Gaussian damping of X and S, resp. To this new model we can 
apply the existing IGA algorithms which require at least the fourth moment to exist. This allows 
us to estimate matrix A (up to signs and permutations of the columns). 

It remains to explain how we get access to the damped random variable Xr. This is done by 
a simple rejection sampling procedure. Damping does not come free and one has to pay for it in 
terms of higher sample and computational complexity, but this increase in complexity is mild in the 
sense that the dependence on various parameters of the problem is still of similar nature as for the 
non-heavy-tailed case. A new parameter R is introduced here which parameterizes the Gaussian 
distribution used for damping. We explained the intuitive meaning of R after the statement of 
Theorem [T] and that it’s a well-behaved quantity for standard distributions. 

Gaussian damping as contrast function. Another way to view Gaussian damping is in terms of 
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contrast functions, a general idea that in particular has been used fruitfully in the ICA literature. 
Briefly, given a function / : M —)■ R, for u on the unit sphere in R"", we compute g{u) := Kf{u'^X). 
Now the properties of the function g{u) as u varies over the unit sphere, such as its local extrema, 
can help us infer properties of the underlying distribution. In particular, one can solve the ICA 
problem for appropriately chosen contrast function f. In ICA, algorithms with provable guarantees 
use contrast functions such as moments or cumulants (e.g., 015]). Many other contrast functions 
are also used. Gaussian damping furnishes a novel class of contrast functions that also leads to 
provable guarantees. E.g., the function given by f{t) = is in this class. We do not use 

the contrast function view in this paper. 

Previous work related to damping. To our knowledge the damping technique, and more generally 
the idea of reweighting the data, is new in the context of ICA. But the general idea of reweighting is 
not new in other somewhat related contexts as we now discuss. In robust statistics (see, e.g., ISU), 
reweighting idea is used for outlier removal by giving less weight to far away data points. Apart from 
this high-level similarity we are not aware of any closer connections to our setting; in particular, 
the weights used are generally different. 

Another related work is [25], on isotropic PCA, afline invariant clustering, and learning mixtures 
of Gaussians. This work uses Gaussian reweighting. However, again we are unaware of any more 
specihc connection to our problem. 

Finally, in PIT] a different reweighting, using a “Fourier weight” ^ (here u G R” is a hxed 
vector and x G R"' is a data point) is used in the computation of the covariance matrix. This 
covariance matrix is useful for solving the IGA problem. But as discussed before, results here do 
not seem to be amenable to our heavy-tailed setting. 

Organization. After preliminaries in the next section, in Sec. [3] we dehne the centroid body 
and prove some useful properties. In Sec. [5] we show how to construct the membership oracle for 
the centroid body of a distribution produced by a symmetric ICA model. In Sec. [6] we use this 
membership oracle to compute the covariance matrix of the uniform distribution on the centroid 
body and using this matrix we orthogonalize the independent components. Finally, in Sec. [9] we 
show why working with symmetric ICA model is without loss of generality. 

2 Preliminaries 

This section contains some basic notation, definitions, and results from previous work. 

We will denote random variables by capital letters, e.g. X, S, and the values they might take by 
corresponding lower case letters, e.g., x, s. For a random variable X, let Rjc denote the probability 
measure induced by X. We take all vectors to be column vectors, and for a vector x, by ||a;|| we 
mean ||x|| 2 . For two vectors G R” we let {x,y) = x^y denote their inner product. For n G N, 
let [n] denote the set of integers 1,... , re. The symbol Bp stands for the re-dimensional unit ^p-ball 
and for the ^ 2 -unit sphere in R”. 

An re-dimensional convex body is a compact convex subset of R"" with non-empty interior. We 
say a convex body K C R^ is absolutely symmetric if (xi,..., x„) £ K 4^ (±®i) • • •, £ K. 

Similarly, we say random variable X (and its distribution) is absolutely symmetric if, for any choice 
of signs ctj G {—1,1}, {xi, ..., Xn) has the same distribution as (aixi,..., a^x^). We say that an re- 
dimensional random vector X is symmetric if X has the same distribution as —X. Note that if X is 
symmetric with independent components (mutually independent coordinates) then its components 
are also symmetric. 
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Let LC C be a non-empty set. The set K° := {x G M” : {x,y) < 1 Vy G K} is called the 
polar of K. The support function of K is hx : M"' —M and defined by hxiO) = sup^-g^ {x, 0). The 
radial function of K is rx ■ K”' \ {0} —?> M and defined by rxiO) = sup {a G M : G iL}. For 

a random variable X G M, let mi{X) = EX* be its zth moment. For each positive integer i the 
cumulant of order i of r.v. X, denoted cumj(X), is a polynomial involving the moments of X. The 
fourth cumulant of X is equal to m 4 (X) —4m3(X)mi(X) —3m2(X)^-|-12m2(X)mi(X)^ —6mi(X)'^. 
When X is symmetric, this simplifies to m 4 (X) — 3m2(X)^. Cumulants can be thought of as a 
higher degree generalization of variance, and they have some nice properties that make them useful 
in ICA, e.g, if X and Y are independent random variables then cum 4 (X-|-y) = cum 4 (X)-|-cum 4 (y). 
Another useful property is that if X is a Gaussian random variable then cum 4 (X) = 0. For this 
reason, the absolute value of the fourth cumulant is often used to quantify the distance of the 
distribution of a random variable from the set of Gaussian distributions. 

Definition 3 ((Symmetric) ICA model). Let A G be an invertible matrix and let S G M** 

he a random vector whose coordinates Si are mutually independent. We then say that the random 
vector X = AS is given by an ICA model. If, in addition, S is symmetric (equivalently, using the 
independence of the components St, each component St is symmetric) then we say that the random 
vector X = AS is given by a symmetric ICA model. 

For X given by a symmetric ICA model we say the matrix B is an orthogonalizer of X if the 
columns of BA are orthogonal. 

We say that a matrix A G is unitary if A^A = I, or in other words A is a rotation matrix. 

(Normally for matrices with real-valued entries such matrices are called orthogonal matrices and 
the word unitary is reserved for their complex counterparts, however the word orthogonal matrix 
can lead to confusion in the present paper.) For matrix C G denote by ||C '||2 the spectral 

norm and by HCH^ the Frobenius norm. We will need the following inequality about the stability 
of matrix inversion (see for example |26l Chapter III, Theorem 2.5]). 

Lemma 4. Let H-H be a matrix norm such that ||Ai?|| < ||A||||i?||. Let matrices C,E G be 

such that ||C'“^Fi ||2 < 1, and let C = C + E. Then 

-i-\\c-^EW 

This implies that if ||F1||2 = \\C — C \\2 < 1/(2||C“^ II 2 ), then 

\\C-^ -C-%<2\\C-^f2m\2 

2.1 Results from convex optimization 

We need the following result: Given a membership oracle for a convex body K one can implement 
efficiently a membership oracle for K°, the polar of K. This follows from applications of the 
ellipsoid method from |23] . Specihcally, we use the following facts; (1) a validity oracle for K can 
be constructed from a membership oracle for K [231 Theorem 4.3.2]; (2) a membership oracle for 
K° can be constructed from a validity oracle for K [231 Theorem 4.4.1]. 

The definitions and theorems in this section all come (occasionally with slight rephrasing) from 
|23j except for the notion of (e, (5)-weak oracle. [The dehnitions below use rational numbers instead 


( 1 ) 
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of real numbers. This is done in [23] as they work out in detail the important low level issues of how 
the numbers in the algorithm are represented as general real numbers cannot be directly handled by 
computers. These low-level details can also be worked out for the arguments in this paper, but as is 
customary, we will not describe these and use real numbers for the sake of exposition.] In this section 
K C is a convex body. For y G M”, the distance oiy io K is given by d{y, K) := min^gj^ \\y — z\\. 
Define S{K,e) := {y G M” : d{y,K) < e} and S{K,—e) := {y G M” : S{y,e) C K}. Let Q denote 
the set of rational numbers. 

Definition 5 ( [23] ) . The e-weak membership problem for K is the following: Given a point y G Q” 
and a rational number e > 0, either (i) assert that y G 5(iL, e), or (ii) assert that y 0 S{K,—e). 
An e-weak membership oracle for K is an oracle that solves the weak membership problem for K. 
For 5 G [0,1], an (e, 5)-weak membership oracle for K acts as follows: Given a point y G Q”, with 
probability at least 1 — 6 it solves the e-weak membership problem for y, K, and otherwise its output 
can be arbitrary. 

Definition 6 l|23ji. The e-weak validity problem for K is the following: Given a vector c G Q”, 
a rational number 7 , and a rational number e > 0 , either (i) assert that c^x < 7 -|- e for all 
X G S{K, —e), or (ii) assert that c^x > 7 — e for some x G S{K,e). The notion of e-weak validity 
oracle and {e, 5)-weak validity oracle can be defined similarly to De/. O 

Definition 7 f |231 Section 2.1]). We say that an oracle algorithm is an oracle-polynomial time 
algorithm for a certain problem defined on a class of convex sets if the running time of the algorithm 
is bounded by a polynomial in the encoding length of K and in the encoding length of the possibly 
existing further input, for every convex set K in the given class. 

The encoding length of a convex set will be specified below, depending on how the convex set 
is presented. 

Theorem 8 (Theorem 4.3.2 in [23]). Let R > r > 0 and qq £ There exists an oracle-polynomial 
time algorithm that solves the weak validity problem for every convex body K C M" contained in 
the ball of radius R and containing a ball of radius r centered at oq given by a weak membership 
oracle. The encoding length of K is n plus the length of the binary encoding of R,r, and oq. 

We remark that the Theorem 4.3.2 as stated in [23] is stronger than the above statement in that 
it constructs a weak violation oracle (not defined here) which gives a weak validity oracle which 
suffices for us. The algorithm given by Theorem 4.3.2 makes a polynomial (in the encoding length 
of K) number of queries to the weak membership oracle. 

Lemma 9 (Lemma 4.4.1 in [23]). There exists an oracle-polynomial time algorithm that solves the 
weak membership problem for K°, where K is a convex body contained in the ball of radius R and 
containing a ball of radius r centered at 0 given by a weak validity oracle. The encoding length of 
K is n plus the length of the binary encoding of R and r. 

Our algorithms and proofs will need more quantitative details from the proofs of the above 
theorem and lemma. These will be mentioned when we need them. 

2.2 Results from algorithmic convexity 

We state here a special case of a standard result in algorithmic convexity: There is an efficient 
algorithm to estimate the covariance matrix of the uniform distribution in a centrally symmetric 


convex body given by a weak membership oracle. The result follows from the random walk-based 
algorithms to generate approximately uniformly random points from a convex body [271 Eg [29]. 
Most papers use access to a membership oracle for the given convex body. In this paper we only 
have access to an e-weak membership oracle. As discussed in m Section 6, Remark 2], essentially 
the same algorithm implements efficient sampling when given an e-weak membership oracle. The 
problem of estimating the covariance matrix of a convex body was introduced in [281 Section 5.2]. 
That paper analyzes the estimation from random points. There has been a sequence of papers 
studying the sample complexity of this problem [301 EH EH ESI EH ESI EH EZ] . 

Theorem 10. Let K C M”' he a centrally symmetric convex body given by a weak membership 
oracle so that rB^ Q K Li RB^. Let T, = Cov(A'). Then there exists a randomized algorithm that, 
when given access to the weak membership of K and inputs r, R, 6, Cc > 0, it outputs a matrix S 
such that with probability at least 1 — 5 over the randomness of the algorithm, we have 

iyu G M") (1 — ec)u^Jlu < u^flu < (1 + €c)u^T,u. (3) 

The running time of the algorithm is poly(n, log(i?/r), l/cc, log(l/(I)). 

Note that ([31) implies jjS — S ||2 < ec||S|| 2 . The guarantee in ([31) has the advantage of being 
invariant under linear transformations in the following sense: if one applies an invertible linear 
transformation C to the underlying convex body, the covariance matrix and its estimate become 
C'EC'^ and respectively. These matrices satisfy 

(Vu G M”) (1 - ec)u^CJ:C^u < u^CtC^u < (1 + 

This fact will be used later. 

3 The centroid body 

The main tool in our orthogonalization algorithm is the centroid body of a distribution, which we 
use as a first moment analogue of the covariance matrix. In convex geometry, the centroid body is 
a standard (see, e.g., [221EI1EEI) convex body associated to (the uniform distribution on) a given 
convex body. Here we use a generalization of the definition from the case of the uniform distribution 
on a convex body to more general probability measures. Let X G M” be a random vector with 
finite first moment, that is, for all u G M” we have E(|(u, X)|) < oo. Following [22], consider the 
function h{u) = E(|(tt, A)|). Then it is easy to see that /i(0) = 0, /i is positively homogeneous, and 
h is subadditive. Therefore, it is the support function of a compact convex set [22l Section 3], [39l 
Theorem 1.7.1], [381 Section 0.6]. This justifies the following definition: 

Definition 11 (Centroid body). Let X G M"" he a random vector with finite first moment, that is, 
for all u we have E(j(u, A)j) < oo. The centroid body of X is the compact convex set, denoted 
TX, whose support function is hrxiu) = E(j(u, A)j). For a probability measure P, we define TP, 
the centroid body o/P, as the centroid body of any random vector distributed according to P. 

The following lemma says that the centroid body is equivariant under linear transformations. 
It is a slight generalization of statements in [22] and [38[ Theorem 9.1.3]. 

Lemma 12. Let X he a random vector on M”. Let A : M”' —>• M” he an invertible linear transfor¬ 
mation. Then r(AA) = A(rA). 
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Proof. X G r(AX) <;=> 'iu{x,u) < E(|(tt,^X)|) 'iu{x,u) < E(|(A^tt,X)|) <;=> 'iv{x,A '^v) < 
E(|(^;,X)|) ^yv{A-^x,v) <E(|(^;,X)|) ^ G r(X) 4^ x G ^r(X). □ 


Lemma 13 ([38l Section 0.8], [39l Remark 1.7.7]). Let K C M"- be a convex body with support 
funetion hx and sueh that the origin is in the interior of K. Then K° is a eonvex body and has 
radial function rK°{u) = l/hK{u) for u G S^~^. 

Lemma [13] implies that testing membership in K° is a one-dimensional problem if we have access 
to the support function hK{-)- We can decide if a point 2 : is in K° by testing if || 2 ;|| < l/hK{z/\\z\\) 
instead of needing to check {z,u) < 1 for all u€ K. In our application K = LX. We can estimate 
hvx{z/\\z\\) by taking the empirical average of \{x^'^\ 2:/||z||)| where the x^®) are samples of X. This 
leads to an approximate oracle for (rX)° which will suffice for our application. The details are in 
Sec. [5l 


4 Mean estimation using 1 + 7 moments 


We will need to estimate the support function of the centroid body in various directions. To this 
end we need to estimate the first absolute moment of the projection to a direction. Our assumption 
that each component Si has finite (1 -|- yj-moment will allow us to do this with a reasonable small 
probability of error. This is done via the following Chebyshev-type inequality. 

Let X be a real-valued symmetric random variable such that E|X|^"*'''' < M for some M > 1 
and 0 < 7 < 1. Then we will prove that the empirical average of the expectation of X converges 
to the expectation of X. 

Let Ejv[|X|] be the empirical average obtained from N independent samples ..., X^^\ i.e., 
(|XW| + --- + |XW)|)/X. 

Lemma 14. Let e G (0,1). With the notation above, for N > we have 

Pr(|IE«[|X|l-E[|X|l| >£|<-^. 

Proof. Let T > 1 be a threshold whose precise value we will choose later. We have 


By the union bound, 


Pr[|X| >T]< 


E[|X|i+T'] 

7 ^ 1+7 


M 

2"l+7 ■ 


/ ■\ i V IVl 

Pr[3z E [A^] such that \ > T] < . 


(4) 


Define a new random variable Xt by 


Xj' 


X if |x| < r, 

0 otherwise. 


Using the symmetry of Xj- we have 

var[|XT|] < E[X|] < TEflXrl] < TE[|X|] < (5) 
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By the Chebyshev inequality and ([5]) we get 


r,' r, n r, n, Var[|XT|l rMV(l+7) 

Pr[|E^[|Xr|] -E[lXr|]| > e'] < < 


Arg/2 


iVe'2 


Putting ([H and dH) together for 0 < e' < 1/2 we get 


NM 

Pr[|£;^[|X|]-E[|XT|]|>e']<;^ + 


J’l+7 


A^e'2 


Choosing 


N := 


jiI+ 7/2 


'M7/(2(1+7)) ’ 


( 6 ) 

(7) 

( 8 ) 


the RHS of the previous equation becomes - 'T’he choice of N is made to minimize 

the RHS; we ignore integrality issues. Pick Tq > 0 so that |E[|X|] — E[|Xto|]| < e'. To estimate Tq, 
note that 


M = E[\X\^+^]>T^E[\\X\-\XtM 


Hence 


|E[|X|] - E[|Xto|]| < E[||X| - IXtoII] < M/T^ 


(9) 


We want M/Tq < e' which is equivalent to Tq > (|f We set Tq := Then, for T >Tq 

putting together ([7]) and @ gives 


Pr[|E^[|X|]-E[|X|]|>26']< 


2M^“'^/(2(1+7)) 

e'r7/2 


Setting e = 26' (so that e G (0,1)) and expressing the RHS of the last equation in terms of N via 
(j8|) (and eliminating T), and using our assumptions 7 , e G (0,1), M > 1 to get a simpler upper 
bound, we get 


Pr[|E^[|X|]-E[|X|]|>6]< 


2 ( 1 + 7 )+ 2 ( 2 +/){ 1 + 7 ) 

1 I ~l 7 
g^^2+7 _/V 2+7 


< 


8M 

(2 jV7/3 ■ 


Condition T > Tq, when expressed in terms of N via ([8]), becomes 


N > 


3 + 1 

22^7 1,1_7_ 


£2^7 


2 ( 1 + 7 ) > 


^8M^^+7 


□ 


11 
















5 Membership oracle for the centroid body 

In this section we provide an efficient weak membership oracle (Subroutine [2]) for the centroid body 
rX of the r.v. X. This is done by first providing a weak membership oracle (Subroutine [T]) for the 
polar body (rX)°. We begin with a lemma that shows that under certain general conditions the 
centroid body is “well-rounded.” This property will prove useful in the membership tests. 

Lemma 15. Let S = (5i,... ,Sn) G K” he an absolutely symmetrically distributed random vector 
such that E(|5j|) = 1 for all i. Then C TS C [— 1 , 1 ]"'. Moreover, C (r5)° C 

Proof. The support function of TS* is hrs{d) = IE|(5,0)| (Def. [H]). Then, for each canonical 
vector Ci, hrsisi) = hrs{—ei) = ElS'jl = 1. Thus, TS is contained in [—1,1]"’. Moreover, since 
[—1,1]” C yLUBf;, we get TS C 

We claim now that each canonical vector e* is contained in TS*. To see why, first note that since 
the support function is 1 along each canonical direction, TS will touch the facets of the unit hyper¬ 
cube. Say, there is a point ( 1 , 3 : 2 ,X 3 ,... ,Xn) that touches facet associated to canonical vector ei. 
But the symmetry of the SiS implies that TS is absolutely symmetric, so that (1, ±X 2 , ± 2 : 3 ,..., ±Xn) 
is also in the centroid body. Convexity implies that (1,0,... ,0) = ei is in the centroid body. The 
same argument applied to all ± canonical vectors implies that they are all contained in the centroid 
body, and this with convexity implies that the centroid body contains B^. In particular, it contains 

□ 

5.1 Membership oracle for the polar of the centroid body 

As mentioned before, our membership oracle for (rX)° (Subroutine [T]) is based on the fact that 
^/hrx is the radial function of (TX)”, and that hrx is the directional absolute first moment of X, 
which can be efficiently estimated by sampling. 


Subroutine 1 Weak Membership Oracle for (TX)” 

Input: Query point y G Q”, samples from symmetric ICA model X = AS, bounds sm > <7max(d), 
Sm < CT min fA), closeness parameter e, failure probability <5. 

Output: Weak membership decision for y G (TA)”. 

1 : Generate iid samples x^^\x^'^\ ..., x^^'l of X for N = poly..^(n, M, 1/sm, sm, I/ej 1/<5). 

2 : Compute 


N 




2 = 1 


3: If ||y|| < l//i, report y as feasible. Otherwise, report y as infeasible. 


Lemma 16 (Correctness of Subroutine [T|). Let 7 > 0 6 e a constant and X = AS be given by a 
symmetric ICA model such that for all i we have E(|Sj|^'’''’') < M < 00 and normalized so that 
E|Si| = 1. Let €,5 > 0. Given sm > U max (A) , Sm < er min (A). Suhroutine\^ is an {e,6)-weak mem¬ 
bership oracle for (TA)® with using time and sample complexity poly..^(n, M, 1/sm, sm, 1/e, 1/5). 
The degree of the polynomial is 0(1/7). 


12 





Proof. Recall from Def. [5] that we need to show that, with probability at least 1 — 5, Subroutined] 
outputs TRUE when y G S'((rX)°,e) and FALSE when y 0 5((rX)°,—e); otherwise, the output 
can be either TRUE or FALSE arbitrarily. 

Fix a point y G Q"' and let 9 := y/||y|| denote the direction of y. The algorithm estimates the 
radial function of (TAi)” along 0, which is l//irx(^) (see Lemma [T3I) . In the following computation, 
we simplify the notation by using h = hYx{9). R is enough to show that with probability at least 
1 — 5 the algorithm’s estimate, l//i, of the radial function is within e of the true value, 1/h. 

For ..., i.i.d. copies of A, the empirical estimator for h is h := ^ ^)l- 

We want to apply Lemma dH to (A, 9). For this we need a bound on its (1 + 7 )-moment. The 
following simple bound is sufficient for our purposes: Let u = A'^9. Then \ui\ < (Tmax(A) for all i. 
Then 


E|(A,0)|^+^ = E|(5,'«)|^+^ = < ^E|5, 


u, 


11+7 


2=1 


2=1 


( 10 ) 




|l+7 |„.ll+7 
1 I I 

2=1 2=1 

Lemma dH implies that, for ei > 0 to be fixed later, and for 


N > 


\ 4^ 


( 11 ) 


we have P{\h — h\ > ei) < 5. 

Lemmas dsi and da give that C TA for r := Sm !< crmin(^)/\Ai'- R follows that h > r. 
If \h — h\ < ei and ei < r/2, then we have 


1 1 

h~h 


\h - h\ ^ ei ^ ^ 
hh ~ r(r - ei) ~ ’ 


which in turn gives, when ei = min{r^e/2, r/2}, 


P 


1 

h 


1 

h 


<e] > P 


1 

h 



> P{\h-h\ < ei) > 1 


5. 


Plugging in the value of ei and, in turn of r, into (fTT]l gives that it suffices to take 


N > poly..^(n,M, l/smjSMjl/e, 1/5). 


□ 


5.2 Membership oracle for the centroid body 

We now describe how the weak membership oracle for the centroid body TA is constructed using 
the weak membership oracle for (TA)”, provided by Subroutine dl 

We will use the following notation: For a convex body K G M”', e, 5>0, R>r>0 such that 
rB 2 Q K Q RB 2 , oracle WMEMi^(e,5,R,r) is an (e,5)-weak membership oracle for K. Similarly, 
oracle WVAL/f (e, 5, R, r) is an (e, 5)-weak validity oracle. Lemma fTCl along with the equivariance of 
r (LemmadSj) gives {sm/y/n)B 2 C TA C (smV^)^ 2 - Then 1/{y/nsM)B 2 ^ (TA)® C {^fn/Sm)B 2 . 
Set r := \l{^\fnsM) and R := ^Jnjsm). 

Detailed description of Subroutine [B There are two main steps: 
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1. Use Subroutine [U to create an (e 2 , 5 )-weak membership oracle (e 2 ,5, i?, r) for 

(rX)°. Theorem 4.3.2 of [23] (stated as Theorem [ 8 ] here) is used in Lemma [TT] to get an 
algorithm to implement an (ei,(5)-weak validity oracle WVAL^pjjf^o (ei, <5, i?, r) running in ora¬ 
cle polynomial time; WVAL(rx)°(^i,5,.R,r) invokes S/Q, R,r) a polynomial 

number of times, specihcally Q = poly(n, log R) (see proof of Lemma 1171) . The proof of 
Theorem 4.3.2 can be modified so that 62 > l/poly(l/ei,R, 1/r). 

2. Lemma 4.4.1 of [23| (stated as Lemma [9] here) gives an algorithm to construct an (e, 5)- 
weak membership oracle WMEI\/lrx(e) l/^j 1/R) from WVAL(px)o (ei, 5, R, r). The proof of 
Lemma 4.4.1 in [23] shows WI\/IEMrx(e, <5,1/r, 1/R) calls WVAL^pj^^o (ei, (5, R, r) once, with 
ei > 1 /poly(l/e, ||y||, l/r) (where y is the query point). 


Subroutine 2 Weak Membership Oracle for TX 

Input: Query point x € M"", samples from symmetric ICA model X = AS, bounds sm > < 7 max(A), 
Sm < < 7 min(A), closeness parameter e, failure probability 6, access to a weak membership oracle 
for {FXy. 

Output: (e, 5)-weak membership decision for x G TX. 

1: Construct WVAL(pj!f)o(ei, (5, R,r) by invoking WMEM(px)o(e 25 < 5 /Q 5 Rj^)- (See Step [D in the 
detailed description.) 

2 : Construct WMEMpx(e, <5, l/r, 1/R) by invoking WVAL(px)o(ei, 5, R,r). (See Step[ 2 ]in the de¬ 
tailed description.) 

3: Return the output of running WMEMpx(e, (5, l/r, 1 /R) on x. 


Lemma 17 (Correctness of Subroutine [2]) . Let X = AS be given by a symmetric ICA model such 
that for all i we have IE(IRjl^“’"''') < M < 00 and normalized so that ElSjl = 1. Then, given a 

query point x G W^, 0 < e < <5 > 0, sm > a -m^A A), and Sm < <7mm(A), Subroutine\^ is an 

e-weak membership oracle for x and TX with probability 1 — 6 using time and sample complexity 
poly (n,M, 1/s 

mt SM, 1/e, l/d). 

Proof. We first prove that WVAL(px)°(ei, <5, R, r) (abbreviated to WVAL(px)o hereafter) works 
correctly. To this end we need to show that for any given input, WVAL(px)o acts as an ei- 

weak validity oracle with probability at least 1 — (i. Oracle WVAL(px)° makes Q queries to 

WI\/IEM(px)o(e 2 ,(^/Q,R, r). If the answer to all these queries were correct then Theorem 4.3.2 
from [23] would apply and would give that WVAL(pj)f)o outputs an answer as expected. Since these 
Q queries are adaptive we cannot directly apply the union bound to say that the probability of all 
of them being correct is at least 1 — Q{6/Q) = 1 — 5. However, a more careful bound allows us to 
do essentially that. 

Let Qi,... ,qk he the sequence of queries, where qi depends on the result of the previous queries. 
For i = 1,... ,k, let Bi be the event that the answer to query q^ by Subroutine (D is not correct 
according to the definition of the oracle it implements. These events are over the randomness of 
Subroutine [D and event Bi involves the randomness of qi,... ,qi, as the queries could be adaptively 
chosen. By the union bound, the probability that all answers are correct is at least 1 — h’r{Bi). 
It is enough to show that Pr(Rj) < 6/Q. To see this, we can condition on the randomness associated 
to ( 71 ,... qi-i. That makes qi deterministic, and the probability of failure is now just the probability 
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that Subroutine [U fails. More precisely, Pr(i?j | gi,... < 5/Q, so that 

Pr(Si) = j Pi{Bi I qi,... ,qi_i)Pr{qi,... ,qi_i) dqi,... ,dqi_i < 6/Q. 

This proves that the first step works correctly. Correctness of the second step follows directly 
because the algorithm for construction of the oracle involves a single call to the input oracle as 
mentioned in Step [ 2 ] of the detailed description. 

Finally, to prove that the running time of Subroutine [ 2 ] is as claimed the main thing to note 
is that, as mentioned in Step [T] of the detailed description, 62 is polynomially small in ei and ei is 
polynomially small in e and so €2 is polynomially small in e. □ 

6 Orthogonalization via the uniform distribution in the centroid 
body 

The following lemma says that linear equivariance allows orthogonalization: 

Lemma 18. Let U be a family of n-dimensional product distributions. Let U be the closure of 
U under invertible linear transformations. Let Q(P) be an n-dimensional distribution defined as a 
function ofFGU. Assume that U and Q satisfy: 

1. For all F G Lf, (5(P) is absolutely symmetric. 

2. Q is linear equivariant (that is, for any invertible linear transformation T we have Q(TF) = 
TQ{F)). 

3. For any F gU , Cov((5(P)) is positive definite. 

Then for any symmetric LCA model X = AS with Fs G U we have Cov((5(Px))~^^^ is an orthogo- 
nalizer of X. 

Proof. Consider a symmetric ICA model X = AS with P 5 G U. Assumptions [T] and [3] imply 
D := Coy{Q{Fs)) is diagonal and positive definite. This with Assumption [2] gives Cov(Q(Pj\:)) = 
Cov(Ag(P 5 )) = ADA'^ = AD^/^{AD^/^f. Let B = Cov(Q(Px))"P2 (the unique symmetric 
positive dehnite square root). We have B = RD~^^^A~^ for some unitary matrix R (see @01 Pg 
406]). Thus, BA = RD~^^^ has orthogonal columns, that is, it is an orthogonalizer for X. □ 

The following lemma applies the previous lemma to the special case when the distribution Q(P) 
is the uniform distribution on TP. 

Lemma 19. Let X be a random vector drawn from a symmetric ICA model X = AS such that 
for all i we have 0 < EjS’jj < 00 . Let Y be uniformly random in TX. Then Cov(y)“^/^ is an 
orthogonalizer of X. 

Proof. We will use Lemma [THl After a scaling of each Si, we can assume without loss of generality 
that E(5'j) = 1. This will allow us to use LemmalEl Let U = {Pw ^ Pw is an absolutely symmetric 
product distribution and E|lFj| = 1, for all i}. For P £ 17, let g(P) be the uniform distribution 
on the centroid body of P. For all Prv £ U, the symmetry of the Wfs implies that FPvi/, that 
is, Q{Fw), is absolutely symmetric. By the equivariance of F (from Lemma [12]) and Lemma [15] 
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it follows that Q is linear equivariant. Let ¥ € U. Then there exist A and ¥w £ U such that 
P = A¥w- So we get Cov(Q(P)) = Cov(^Q(Pvy)) = ^Cov(rPvi/)^^. From Lemma fT5] we know 
Bn ^ FPru so that Cov(rPn/) is a diagonal matrix with positive diagonal entries. This implies that 
Cov((5(P)) is positive definite and thus by Lemma fTHl Cov(y)“^/^ is an orthogonalizer oi X. □ 


Algorithm 1 Orthogonalization via the uniform distribution in the centroid body 
Input: Samples from symmetric ICA model X = AS, bounds sm > crmax(^); Sm < < 7111111(^)5 error 
parameters e and 5, access to an (e, 5)-weak membership oracle for TA provided by Subroutine[2l 
Output: A matrix B which orthogonalizes the independent components of X. 

1: Let S be an estimate of Cov(rA) obtained via Theorem 1101 sampling algorithm such as the one 
in m with Cc = e/(2(n + 1)^), r = Sml\pn-, R = SMy/n, and same 6. 

2: Return B = 


Theorem 20 (Correctness of algorithm [T]). Let X = AS be given by a symmetric ICA model 
such that for all i we have E(|5i|^^'^) < M < 00 and normalized so that E|5j| = 1. Then, given 
0 < e < , (5 > 0, sm > <7max(^) 1 Sm < CT min fA) , Algorithm [1] outputs a matrix B so that 

WA^B'^BA — D \\2 < e, for a diagonal matrix D with diagonal entries di,... ,dn satisfying l/(n + 
1)^ < di < 1. with probability at least 1 — 6 using poly..^(re, M, 1/sm, sm, I/ej 1/<^) time and sample 
complexity. 

Proof. Prom Lemma [15] we know B^ C TiS C [—1,1]”. Using the equivariance of T (Lemma I12jl . 
we get arain{A)/^/nB2 C TX C (Tniax(^)\Aj'.B2 • Thus, to satisfy the roundness condition of Theo¬ 
rem [10] we can take r := Sm.l\pn < (J m\A AAj\fn. R := SMy/n > ama.x{A)\/n. 

Let S be the estimate of S := Cov(rA) computed by the algorithm. Let A := be the 

estimate of A := GOVTS' obtained from S according to how covariance matrices transform under 
invertible linear transformations of the underlying random vector. As in the proof of Lemma [TBl we 
have S = AKA^ and B = RA~^^‘^A~^ for some unitary matrix R. Thus, we have A^B^BA = A~^. 
It is natural then to set D := A“^ = Cov(rS)“^. Let di,. .. ,dn be the diagonal entries of D. We 
have, using Lemma [U 

WA'^B^BA - D\\2 = ||A-i - A-i||2 

— ||A”M| ||A (A-A)||2 , . 

As in ([5]), we show that ||A“^(A — A )||2 is small: 

We first bound (di), the diagonal entries of D = A“^. Let dmax := maxj di and d rr^■^u := minj di. 
We hnd simple estimates of these quantities: We have d^in = I/IIAII 2 and ||A ||2 is the maximum 
variance of TS along coordinate axes. From Lemma[T5]we know FS C [—1,1]"", so that ||A ||2 < 1 
and dmin A 1- Similarly, dmax = l/'7min(A), where iTmin(A) is the smallest diagonal entry of A. In 
other words, it is the minimum variance of FS* along coordinate axes. From Lemma [15] we know 
FS' D Bf, so that FS* 2 [—ei,ei] for all i and Lemma ET] below implies (Tmin(A) > l/(n-|- 1)^. That 
is, <imax < (n + 1)^. 

From the bounds on di, Theorem [TO] and the fact discussed after it, we have 
||A-i(A - A)|| < d^ax||A||ee < (n + ife^ < 1/2, 
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when €c < l/( 2 (n + 1 )^). 

This in (I12p with Theorem 1101 again gives 


A^B^BA-D\\ < 2||A-i||||A-i(A-A)|| < 2||A-if ||A - A|| <2dl 


'max 


edlA|| < 2 (n + l)'^e, 


The claim follows by setting ec = e/(2(n + 1)'^). The sample and time complexity of the algorithm 
comes from the calls to Subroutine [2j The number of calls is given by Theorem 1101 This leads to 


the complexity as claimed. 


□ 


Lemma 21. Let K C M"' be an absolutely symmetric convex body such that K contains the seg¬ 
ment [— 61 ,ei] = convjei,—ei} (where ei is the first canonical vector). Let X = {Xi,...,Xn) be 
uniformly random in K. Then var(Xi) > l/(n + 1)^. 

Proof. Let H be a diagonal linear transformation so that DK isotropic. Let du be the first entry 
of H. It is known that any n-dimensional isotropic convex body is contained in the ball of radius 
n + 1 mmM Theorem 4.1]. Note that DK contains the segment [—dnei,dncij. This implies 
dll < (n + 1). Also, by isotropy we have, 1 = var((iiiAi) = dh var(Ai). The claim follows. □ 

7 Gaussian damping 

In this section we give an efficient algorithm for the heavy-tailed ICA problem when the ICA matrix 
is a unitary matrix; no assumptions on the existence of moments of the Si will be required. 

The basic idea behind our algorithm is simple and intuitive: using X we construct another ICA 
model Xji = ASr, where i? > 0 is a parameter which will be chosen later. The components of Sr 
have light-tailed distributions; in particular, all moments exist. We show how to generate samples 
of Xr efficiently using samples of X. Using the new ICA model, the matrix A can be estimated by 
applying existing ICA algorithms. 

For a random variable Z we will denote its the probability density function by pz{')- The 
density of Xr is obtained by multiplying the density of A by a Gaussian damping factor. More 


precisely. 



Define 



then 


PXnix) = -r^Px{x)e 


We will now find the density of Sr. Note that if x is a value of Xr, and s = A is the 
corresponding value of Sr, then we have 
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where we used that A is a unitary matrix so that ||^s|| = ||s||. Also, px{x) = ps{s) follows from 
the change of variable formula and the fact that |det A| = 1. We also used crucially the fact that 
the Gaussian distribution is spherically-symmetric. We have now specified the new ICA model 
Xu = ASr, and what remains is to show how to generate samples of Xr. 

Rejection sampling. Given access to samples from px we will use rejection sampling (see e.g. 
[Robert-Gasella] ) to generate samples from pxr- 

1. Generate x ^ px- 

2. Generate 2 : ~ U[0, 1]. 

3. If 2 : G output X-, else, go to the first step. 


The probability of outputting a sample with a single trial in the above algorithm is Kx^- Thus, 
the expected number of trials in the above algorithm for generating a sample is YjKxj^- 

We now choose R. There are two properties that we want R sufficiently large so as to satisfy: 
(1) Kxj^ > C\ and |cum 4 S'yij| > Ijn^^ where Ci G (0,1/2) and C 2 > Q are constants. Such a 
choice of R exists and can be made efficiently; we outline this after the statement of Theorem [2l 
Thus, the expected number of trials in rejection sampling before generating a sample is bounded 
above by 1/C\. The lower bound on Kx^ will also be useful in bounding the moments of the Sj^R^ 
where Sj^R is the random variable obtained by Gaussian damping of Sj with parameter R, that is 
to say 

Define 

■= [ Psjisj)e~"V^^dsj 

Jr 

and let Ks_j r be the product of Ksf.^ over A: G [re] \ {j}. By s-j G we denote the vector 

s G M” with its jth element removed, then notice that 


^Xr = Ksr = A'si.flATsj ^ ... Ks„^r, 

Kxr = Ks^^j^Ks_^^s- 

We can express the densities of individual components of Sr as follows: 


PSi.Ri^j) 


psR{s)ds-j = 


-7^ [ Ps{s)e ^ PSj{sj)e . 

J^Xr Jm"-i Rxr 


This allows us to derive bounds on the moments of Sj r: 


(13) 

(14) 



We now state Theorem 4.2 from [7] in a special case by setting parameters k and ki in that 
theorem to 4 for i G [re]. The algorithm analyzed in Theorem 4.2 of [7] is called Fourier PGA. 
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Theorem 22. Let X E M” be given by an ICA model X = AS where A E is unitary and 

the Si are mutually independent, < M 4 for some positive eonstant M 4 , and |cum 4 ( 5 i)| > A. 

For any e > 0 with probability at least I — S, Fourier PC A will reeover veetors { 61 ,... , 6 „} sueh 
that there exist signs a* = ±1 and a permutation vr : [n] —>■ [n] satisfying 

\\Ai — ^7 

using poly(n, M 4 ,1/A, 1/e, 1/5) samples. The running time of the algorithm is also of the same 
form. 

Combining the above theorem with Gaussian damping gives the following theorem. As previ¬ 
ously noted, since we are doing rejection sampling in Gaussian damping, the expected number of 
trials to generate N samples of Xu is N/Kx^ ■ One can similarly prove high probability guarantees 
for the number of trials needed to generate N samples. 

Theorem 2. Let X = AS be an ICA model such that A E is unitary (i.e., A^A = I) and 

the distribution of S is absolutely continuous. Let A > 0 be such that for each i E [n] if Si has 
finite fourth moment then |cum 4 (jS'j)| > A. Then, given e, 5 > 0, Caussian damping combined with 
Fourier PCA outputs 61 ,..., E M” such that there are signs ai E {—1,1} and a permutation 
TT : [re] —)• [re] satisfying \\Ai — < e, in poly(re, i?, 1/A, 1/e, 1/5) time and sample complexity 

and with probability at least 1 — 5. Here R is a parameter of the distributions of the Si as described 
above. 

We remark that the choice of R in the above theorem can be made algorithmically in an efficient 
way. Theorem 1231 below shows that as we increase R the cumulant cum 4 (S'j^ij) goes to inhnty. This 
shows that for any A > 0 there exists R so as to satisfy the condition of the above theorem, 
namely |cum 4 ( 5 i^re)| ^ We now briefly indicate how such an R can be found efficiently (same 
sample and computational costs as in Theorem [2] above): For a given R, we can certainly estimate 

Kxfi = Jjgn l^dx from samples, i.e. by the empirical mean of 

samples ... , This allows us to search for R so that Kx^ is as large as we want. This also 
gives us an upper bound on the fourth moment via Eq. (1151) . To ensure that the fourth cumulants 
of all Si^R are large, note that for a E M” we have cum 4 (aiS'i_re+- • ■+anSn,R) = Z]i6[n] ofcum 4 (S'j^re)- 
We can estimate this quantity empirically, and minimize over a on the unit sphere (the minimization 
can be done, e.g., using the algorithm in [5]). This would give an estimate of min^gj^] cum 4 ( 5 i^re) 
and allows us to search for an appropriate R. 

For the algorithm to be efficient, we also need Kxji > Ci. This is easily achieved as we can 
empirically estimate Kx^ using the number of trials required in rejection sampling, and search for 
sufficiently large R that makes the estimate sufficiently larger than Ci. 

8 The fourth cumulant of Gaussian damping of heavy-tailed dis¬ 
tributions 

It is clear that if r.v. X is such that E(A'^) = 00 and E(X^) < 00 , then cum 4 (Are) = E(A^) — 
3(E(X^))^ —)■ 00 as i? —)■ 00 . However, it does not seem clear when we have E(A^) = 00 as well. 
We will show that in this case we also get cum 4 (Aij) —)■ 00 as i? —)■ 00 . 

We will confine our discussion to symmetric random variables for simplicity of exposition; for 
the purpose of our application of the theorem this is w.l.o.g. by the argument in Sec. [9l 
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Theorem 23. Let X be a symmetric real-valued random variable with E(X^) = oo. Then cum 4 (Xij) —)■ 
oo as R ^ oo. 


Proof. Fix a symmetric r.v. X with = oo; as previously noted, if EX^ < oo then the theorem 
is easily seen to be true. Since X is symmetric and we will be interested in the fourth cumulant, 
we can restrict our attention to the positive part of X. So in the following we will actually assume 
that X is a positive random variable. Fix C > 100 to be any large positive constant. Fix a small 
positive constant ei < Also fix another small positive constant €2 < 1/10. Then there exists 

a > 0 such that 


Pr[X > a] = px{x)dx < ei. 


(16) 


Let 7712 (^ 2 ) := x^px{x)e Recall that p(x)dx = E e Note 

that if R > a (which we assume in the sequel), then 


1 > Rxr > 


1 - ei 


(17) 


Since m 2 (7?) —)• 00 as 7? — 00 , by choosing R sufficiently large we can ensure that 

^00 /•OO 

/ x^pxix)e~^^ dx'^iX — eq) / a?px{x)e~^'^dx. (18) 

J a Jo 


Moreover, we choose R to be sufficiently large so that y/Cm 2 {R) > a. Then 

/•OO 

7712 ( 7 ?) = / x^ px{x)e~^^ ^ dx 

Jo 

pa p-\J Crh2{R) poo 

= / x‘^px{x)e~^^^^^dx-\- / x^px{x)e~^^^^^dx + _ x‘^px{x)e~^^^^^dx 

Jo Ja JjCrh 2 (R) 


< £2 


r°° ryfCrafiR) „ „ r°o 

/ x‘^px{x)e-^ dx+ / x^px{x)e-^ dx+ / _ x‘^px{x)e-^ dx 

Jo Ja J^/CihoAR) 


y/Cfh 2 {R) 


poo 

< €27712(7?) + 61(77712(7?) + _ x ^ px { x ) e~^^/^^dx (by (US])) 

Jy/Cfh2(R) 

poo 

= {Cei + € 2 ) 7712 ( 7 ?) + / _ x^px{x)e~'^^/^^dx. 

Jy/Crh 2 {R) 

Summarizing the previous sequence of inequalities: 


ly/Crh2(R) 


x^px{x)e ^ dx > {\ — Ce\ — € 2 ) 7712 ( 7 ?). 


( 19 ) 


Now 


EX^ > Kx, 


poo 

EX^ = / x^px{x)e~^^dx fthe inequality uses (1171) 1 
Jo 


/•C — 

> / _ x^px{x)e~^^^^^ dx 

Jy'C7h2{R) 
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POO 

>Cm2{R) / _ px{x)e~^^dx 

J^/Crh2{R) 

> C'm 2 (-R)(l - Cei - € 2 ) m 2 (-R) (by ([T9])) 

= C{l-Cei-e2)m2{Rf 

= C{l-Cei-e2){Kx^^Xlf 

> C(1 - Cei - € 2 ) (EX2)2 (by (HID). 

Now note that C{l — Cei — € 2 ) > 10 for our choice of the parameters. Thus cum 4 (XH) = 

— 3(EX^)^ > 7(EJT^)^, and by our assumption EX^ —)• 00 with R —)■ 00 . □ 


9 Symmetrization 

As usual we work with the ICA model X = AS. Suppose that we have an ICA algorithm that 
works when each of the component random variable Si is symmetric, i.e. its probability density 
function satisfies 4)i{y) = (j)i{—y) for all y, with a polynomial dependence on the upper bound M 4 
on the fourth moment of the Si and inverse polynomial dependence on the lower bound A on the 
fourth cumulants of the Si. Then we show that we also have an algorithm without the symmetry 
assumption and with a similar dependence on M 4 and A. We show that without loss of generality 
we may restrict our attention to symmetric densities, i.e. we can assume that each of the Si has 
density function satisfying (pi{y) = (j)i{—y). To this end, let S'- be an independent copy of Si and 
set Si := Si — 5'. Similarly, let Xi := Xi — X[. Clearly, the Si and Xi have symmetric densities. 
The new random variables still satisfy the ICA model: Xi = ASi. Moreover, the moments and 
cumulants of the Si behave similarly to those of the Sf. For the fourth moment, assuming it exists, 
we have E[5j^] = E[(5i — 5'-)^] < 2^E[5j^]. The inequality above can easily be proved using the 
binomial expansion and Holder’s inequality: 

E[{Si - 5')^] = E[5f] + 4E[5f] E[5'] + 6E[52] E[(5')^] + 4E[5i] E[(5')^] + £[(5^^] < 16E[5f]. 

The final inequality follows from the fact that that for each term in the LHS, e.g. E[5?]E[S'l] we 
have E[5f] E[S'] < E[Sf]y^n{S'n^^ = E[5f]. 

For the fourth cumulant, again assuming its existence, we have cum 4 ( 5 j) = cum 4 ( 5 i — 5') = 
cum4(5i) + cum4(—5i) = 2 cum4(5j). 

Thus if the the fourth cumulant of Si is away from 0 then so is the fourth cumulant of Si. 


10 Putting things together 

In this section we combine the orthogonalization procedure (Algorithm [1] with performance guar¬ 
antees in Theorem [20]) with ICA for unitary A via Gaussian damping to prove our main theorem. 
Theorem [H 

Theorem 1 (Heavy-tailed ICA). Let X = AS be an ICA model such that the distribution of S is 
absolutely continuous, for all i we have E(|iS'j|^'*‘''') < M < 00 and normalized so that Els'll = 1, 
and the columns of A have unit norm. Let A > 0 be such that for each i G [re] if Si has finite 
fourth moment then its fourth cumulant satisfies |cum 4 (S'j)| > A. Then, given 0 < e < re^, d > 0, 
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sm > q' ma v(A), Sm < crmin(A), Algorithm U\ Combined with Gaussian damping and Fourier PCA 
outputs bi,...,bn G M” such that there are signs ai G {—1,1} and a permutation vr : [n] —>• [n] 
satisfying ||^j — || < e, with poly^(n, M, l/s^) sm; 1/A, i?, 1/i?, 1/e, 1/(5) iime and sample 

complexity and with probability at least 1 — 5. Here R is a parameter of the distributions of the Si 
as described below. The degree of the polynomial is 0 ( 1 / 7 ). 

As noted in the introduction, intuitively, R in the theorem statement above measures how large 
a ball we need to restrict the distribution to so that there is at least a constant (or 1 / poly(n) if 
needed) probability mass in it and moreover each Si when restricted to the interval [—has the 
fourth cumulant at least n(A). Formally, i? > 0 is such that dx > p{n) > 0, 

where 1/poly(ri) < p{n) < 1 can be chosen, and for simplicity, we will fix to 1/2. Moreover, R 
satisfies that cum 4 (£'j^/{) > Fl{/S./n^) for all u G S'^~^ and i G [n], where Si^R is the Gaussian 
damping with parameter R oi Si. 

In Sec. Owe saw that the moments and cumulants of the non-symmetric random variable behave 
similarly to those of the symmetric random variable. So by the argument of Sec. Owe assume that 
our ICA model is symmetric. Theorem 1201 shows that Algorithm [1] gives us a new ICA model with 
the ICA matrix having approximately orthogonal columns. We will apply Gaussian damping to 
this new ICA model. In Theorem 1201 it was convenient to use the normalization E|5j| = 1 for all 
i. But for the next step of Gaussian damping we will use a different normalization, namely, the 
columns of the ICA matrix have unit length. This will require us to rescale (in the analysis, not 
algorithmically) the components of S appropriately as we now describe. 

Algorithm [1] provides us with a matrix B such that the columns oi C = BA are approximately 
orthogonal: C^C ~ D where D is a diagonal matrix. Thus, we can rewrite our ICA model as 
Y = CS, where Y = BX. We rescale Cj, the ith column of C, by multiplying it by l/||Cj||. 
Denoting by L the diagonal matrix with the ith diagonal entry l/||Ci||, the matrix obtained after 
the above rescaling of C is CL and we have {CL)'^CL ~ I. We can again rewrite our ICA model 
as y = {CL){L~^S). Setting E := CL and T := L~^S we can rewrite our ICA model as T = ET. 
This is the model we will plug into the Gaussian damping procedure. Had Algorithm [1] provided 
us with perfect orthogonalizer B (so that C^C = D) we would obtain a model Y = ET where E 
is unitary. We do get however that E k, E. To continue with a more standard ICA notation, from 
here on we will write X = ES for Y = ET and X = ES for Y = ET. 

Applying Gaussian damping to A = ES gives us a new ICA model Xr = ESr as we saw in 
Sec. [71 But the model we have access to is A = ES. We will apply Gaussian damping to it to 
get the r.v. Xr. Formally, Xr is defined starting with the model A = ES just as we defined Xr 
starting with the model X = ES (recall that for a random variable Z, we denote its probability 
density function by pz{■))'■ 


Pxr(^) ■= 

where := dx. The parameter R has been chosen so that > Ci := 1/2 

and cum 4 (A/j,tt) > A for all u G By the discussion after Theorem [2] (the restatement in 

Sec. [7]), this choice of R can be made efficiently. (The discussion there is in terms of the directional 
moments of S, but note that the directional moments of A also give us directional moments of S. 
We omit further details.) But now since the matrix E in our ICA model is only approximately 
unitary, after applying Gaussian damping the obtained random variable Xr is not given by an ICA 
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model (in particular, it may not have independent coordinates in any basis), although it is close to 
Xr in a sense to be made precise soon. 

Because of this, Theoreml22lis not directly usable for plugging in the samples of X^. To address 
this discrepancy we will need a robust version of Theorem [22] which also requires us to specify in 
a precise sense that X^ and X^ are close. To this end, we need some standard terminology from 
probability theory. The characteristic function of r.v. X G M"' is defined to be 4>x{u) = E(e*“ ^), 
where u G M"'. The cumulant generating function, also known as the second characteristic function, 
is defined by ipx{u) = log(j)xiu). The algorithm in [7] estimates the second derivative of ipx{u) 
and computes its eigendecomposition. (In [6] and [7], this second derivative is interpreted as a 
kind of covariance matrix of X but with the twist that a certain “Fourier” weight is used in the 
expectation computation for the covariance matrix. We will not use this interpretation here.) 
Set 'kx('u) := D'^ipx{u), the Hessian matrix of ijjx{u). We can now state the robust version of 
Theorem |22l 

Theorem 24. Let X be an n-dimensional random vector given by an ICA model X = AS where 
A G is unitary and the Si are mutually independent, E[S'^] < M 4 and |cum 4 (S'j)| > A for 

positive constants M 4 and A. Also let G [0,1]. Suppose that we have another random variable 
X that is close to X in the following sense: 

|d'^(u) - d'x(«)| < ^ 

for any u G M” with ||u|| < 1. Moreover, E[(A, < M4 for ||u|| < 1. When Fourier PCA is 

given samples of X it will recover vectors { 61 ,... ,&„} such that there exist signs Oi G {—1,1} and 
a permutation tt : [n] —)• [n] satisfying 


in poly(n, M 4 ,1/A, 1/q^ samples and time complexity and with probability at least 1 — 

While this theorem is not stated in [7| , it is easy to derive from their proof of Theorem \2^ we 
now briefly sketch the proof of Theorem [2H indicating the changes one needs to make to the proof 
of Theorem [22] in [7] . 

Proof. Ideally, for input model X = AS with A unitary, algorithm Fourier PCA would proceed 
by diagonalizing '^x{u). But it can only compute an approximation '^^('^i) which is the empirical 
estimate for ^x('^^)- For ah u with ||tt|| < 1 , it is shown that with high probability we have 

ll^x(w) - 4 'x(m)||^ < e. (20) 

Then, a matrix perturbation argument is invoked to show that if the diagonalization procedure used 
in Fourier PCA is applied to 'kx(rt) instead of 'l'x(rr)) one still recovers a good approximation of A. 
This previous step uses a random u chosen from a Gaussian distribution so that the eigenvalues of 
'I'x(u) are sufficiently spaced apart for the eigenvectors to be recoverable (the assumptions on the 
distribution ensure that the requirement of ||tt|| < 1 is satisfied with high probability). The only 
property of ^x{u) used in this argument is (l^nil . To prove Theorem [211 we show that the estimate 
is also good: 

||4'^(u) - d'x(?/)||^ < ll^^(u) - d';f(u)||^ + ||d'^(u) - ^'x(«)||^ < 2e, 


\Ai II ^ 


Ua 
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where we ensured that < e by taking sufficiently many samples of X to get a 

good estimate with probability at least 5; as in [7], a standard concentration argument shows that 
poly(n, M 4 ,1/A, 1/e, 1/(5) samples suffice for this purpose. Thus the diagonalization procedure can 
be applied to T^(n). The upper bound of 2e above translates into error < e in the final 

recovery guarantee, with the extra factor coming from the eigenvalue gaps of D 

To apply Theorem [ 23 ] to our situation, we need 


This will follow from the next lemma. 
Note that 


_ D‘^(t>x{u) _ {Dc/)x{u))'^{D(t>x{u)) 

4>x{u) (t)x{uY 


( 21 ) 


( 22 ) 


(The gradient D(px{u) is a row vector.) Thus, to show (j21j) it suffices to show that each expression 
on the RHS of the previous equation is appropriately close: 

Lemma 25. Let A G [0,1/2], and let E,E € sueh that E is unitary and \\E — E \\2 < A^/3. 

Let Xji and Xji be the random variables obtained by applying Gaussian damping to the ICA models 
X = ES and X = ES, resp. Then, for ||u|| < 1, we have 


(I^Xr{u) - (l)x^{u) 


< RX^/3 + 4A + 


4A 


\\D(I)Xr{u) - D(t>j^^{u)\\ < 0{nXR^), 
D^^xAu) - < 0{n^XR^). 


Proof. We will only prove the first inequality; proofs of the other two are very similar and will be 
omitted. In the second equality in the displayed equations below we use that Px{x)dx 

fj^n ps{s) ds. One way to see this is to think of the two integrals as expectations: 

(^iuTX^-\\xf/R^\ ^ ^ (^iuTES^-\\Esf/R^\^ 


4>Xr{u) - f)xju) 


< 


+ 


Kxr 

1 


/ 

Jw 

^-j 

Xr Jr-' 

j 

jR' 


Kxr 

1 


^iv/-x^—\\x\\ /Ft 


||Es 


px{x)dx — 


K 


Xr 


I 

jR' 




^iu-^ Es^—\\Es\\ /R 


Px{x)dx 
ps{s)ds 


Kxr K 


R Xr 


/^%s{s)ds - — 3 — f 

J<Xr Jr 

Ps{s)ds — f ps{s)ds 

Jr" 

Jr" 


^iu^Es^-WEsf 


<—f 

Kxr Jr' 


^iuTEs^-\\Esf/R^ _ ^iuTEs^-\\Es\f/R^ 


ps{s)ds- 


Kxr K. 


K 


Xr 


G 


H 
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Now 


G = 


I 

Jw 


JuT[E-E)s(\\Esf-\\Es\\ )/i?2 


- 1 


e ps{s)ds. 


We have 


^iuTiE-E)s^{\\Esf-\\Es\\ )/ij2 _ ^ < ^iuT(E-E)s _ ^ ^ g(||Es||2-||£.|| )/R^ _ ^ 


and so 


G < 


I 

JR’ 


^iu^{E-E)s _ ^ 


^-\\Esf/R^ 


ps{s)ds + / 
JR’ 


^{\\Es\\^-\\Es\\ )/R^ _ ^ 


^-\\Esf/R^ 


ps{s)ds. 

(23) 


For the first summand in (1231) note that 

Ju^{E-E)s _ 2 


< 


u^{E - E)s 


This follows from the fact that for real 6 we have 

2 


e*® - 1 


= (cos 6 — ly + sim 9 = 2 — 2 cos 6 = 4sim(0/2) < 


So, 


[ u^{E - E)s e-ll®*ll'/^Vs(s)ds 

JR” 


< llttll \\E — E\\ 


I 

JR’ 


s e 


-||s||V-R^ 


ps{s)ds (using \\Es\\ = ||s|| as E is unitary) 


< ||tt||ll-F — F^IL ( maxze / pg(s)ds 

V / JR” 


< ||m ||\\E — E\\2R 

< RX'^/3, 

where the last inequality used our assumption that ||rt|| < 1 . 

We will next bound second summand in (I23p . Note that \\Es\\ = ||s|| and ||-Es||^ — \\Es\\ < 

(Pll + pIDdl^-^IDIIsf. Since ||^-^|| < X‘^/3 « 1, we get ||^sf - ||^sf < (||F;|| + 
||^||)(||-F — ^||)||s||^ < A^||s|p. We will use that e"^ — 1 < A + for A G [0, 1 / 2 ] which is satisfied 
by our assumption. Now the second summand in (|23p can be bounded as follows. 


[ y\\Esf-\\my/R^ _ ^ 

JR” 

X^\\sf/R^ _ 2 


/ 

JR’ 

/ 

JR’ 




e ps{s)ds 

P-Ildl7fi= 


psis)ds 
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eA^INlVi?^ _ 1 


eA^INlVi?^ _ 1 


J||s||>R/v/A 



J\\s\\<R/VX 

< f (A + ps{s)ds + [ 

■J\\4<R/V\ J\\s\\>R/^ 

< (A + X^)Kx^ + 

< 2XKxn + 2A (using A < 1/2). 


-\\s\?/R^ 


ps{s)ds 


-[l-X^)\\s\\^/R^ 


ps{s)ds 


Combining our estimates gives 


^ i?A2 ^ 2A 
G < — -h 2A + 




Kx, 


Finally, to bound H, note that 


1 


Kx, 


Kx,-K 


x^ 


1 


KXr 

1 

KXr 


-m\Vr^ 


px{x)dx — I e pj^{x)dx 

JR" 

-I|E.||“/K^p^(,)*_ f e-"^‘f/«%s{s)ds 

JR" 


This we just upper-bounded above by 2A -|- 
Thus we have the hnal estimate 


(pXRiu) - <G + H <RX'^/3 + 4:X + 


4A 

Kxr 


The proofs of the other two upper bounds in the lemma follow the same general pattern with 
slight changes. □ 

We are now ready to prove Theorem [TJ 

Proof of Theorem [ij We continue with the context set after the statement of Theorem [TJ The plan 
is to apply Theorem [2T| to and X^. To this end we begin by showing that the premise of 
Theorem 1241 is satisfied. 

Theorem (201 with ^20] = <^/2 and be specified later, provides us with a matrix B such 

that the columns of BA are approximately orthogonal: \\{BA)"^BA — D \\2 < for some diag¬ 
onal matrix D. Now we set E := BAL, where L = diag(Li,... ,T„) = diag(l/\/^, • • •, l/\/^)- 
Theorem 1201 implies that 


1 < Tvj < (n -I- 1). 


(24) 


Then 


\\E^E-I\\2 = \\iBALfiBAL)-I\\2 

= \\L^{BAfBAL - L^DL \\2 < ||L|| 2 ||(.B^)^.B^ - T >||2 <{n + 

because Lj < (n -|- 1) by (|24l) . For E as above, there exists a unitary E such that 

\\E - E\\2^ + (25) 
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by Lemma [26] below. 

By our choice of R, the components of Sr satisfy cum 4 (S'j^ij) > A and < 2R^ (via 

(fT^ and our choice Ci = 1/2). Hence M 4 < 2R‘^. The latter bound via (1^ gives E[(A, < 

2(1 + (n + 1 )^^ 20 |)-^^ ^ ^ §”“^. 

Finally, Lemma [25] with (f2^ and simple estimates give < 0{n^R!^^^. 

We are now ready to apply Theorem [2H with = 0{n^R‘^^^) and = (5/2. This gives 
that Fourier PC A produces output 64 ,... , 6 ^ such that there are signs = ±1 and permutation 
vr : [n] ^ [n] such that 



with poly(n, 1/A,ii, 1/ii, 1/^^ 1/5) sample and time complexity. Choose so that the RHS of 
(126[) is e. 

The number of samples and time needed for orthogonalization is poly..^ (n, M, 1/sm, 

Substituting the value of ^^the previous bound becomes poly..y(n, M, 1/Smi Sm, 1/A, i?, 1/i?, 1/e, 1/(5). 
The probability of error, coming from the applications of Theorem 1201 and 1241 is at most 5/2-\-5/2 = 

(5. □ 

Lemma 26. Let E £ be such that —/II 2 < e. Then there exists a unitary matrix 

E £ such that \\E — EW^ < e. 

Proof. This is related to a special case of the so-called orthogonal Procrustes problem [331 Section 
12.4.1], where one looks for a unitary matrix E that minimizes ||Fi — E\\p. A formula for an optimal 
E is E = UV'^, where U'EV'^ is the singular value decomposition of E, with singular values {(Xi). 
Although we do not need the fact that this E minimizes ||Fi — E\\p, it is good for our purpose: 

||.F - F ;||2 = \\UT,V^ - C/P ^||2 = ||s - /II 2 = max|o-i - 1|. 

i 

By our assumption 

E — /II2 = — /II2 = ||S^ — /II2 = maxlu/ — 1 | = max((Ti -|- l)|(Tj — 1 | < e. 

i i 

This implies max/ui — 1| < e. The claim follows. □ 
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